Using text data instead of SIC codes to tag innovative firms and classify industrial activities

The paper uses text mining and semantic algorithms to tag innovative firms and offer an alternative perspective to classify industrial activities. Instead of referring to firms’ standard industrial classification codes, we gather information from companies’ websites and corporate purposes, extract keywords and generate tags concerning firms’ activities, specializations, and competences. Evidence is interesting because allows us to understand ‘what firms do’ in a more penetrating and updated way than referring to standard industrial classification codes. Moreover, through matching firms’ keywords, we can explore the degree of closeness between the firms under observation, a measure by which researchers can derive industrial proximity. The analysis can provide policymakers with a detailed and comprehensive picture of the innovative trajectories underlying the industrial structure in a geographic area.


Introduction
Innovative firms can seize opportunities created through technological progress, and generate demand for skilled labour, higher wages, and productivity gains [1]. Given the role they play, it is essential to correctly enucleate innovative firms and their industrial activities.
An increasing number of scholars argue that it is not sufficient to refer to the standard industrial classification (SIC) codes to realize firms' activities and appreciate their degree of innovativeness [2,3]. SIC codes are a consolidated taxonomy and are currently used for the statistical surveys of economic activities, but they show many limitations as widely discussed in the literature.
The information attached to the code itself is scarce: this is limited, in fact, to a very short textual definition (e.g., in Europe NACE code M72.1.1 is equivalent to 'Research and experimental development on biotechnology'). The codes are limited with respect to the existing variety of business activities. Although the codes in the current SIC are more than a thousand, they are not (and probably never will be) enough to describe the abundance of firms' activities. Such codes, which are obsolete and not up to date with technological and market trends, go 'tight' to innovative firms. Sometimes firms opt for residual and broader codes. For example, in Europe the NACE code M74.9 -'Other professional, scientific and technical activities': these definitions are too broad and vague to be informative on firms' activities. Finally, we know that firms innovate, transform, and renew continuously: the code chosen yesterday can only partially reflect what the firm does today or will do tomorrow [4]. Instead of referring to SIC codes, we gather information from firms' websites and corporate purposes, extract keywords and generate tags concerning firms' activities, specializations, and competencies. Our methodology assumes that firms use the same (or similar) words to identify and describe the same (or similar) activities, regardless of the SIC codes chosen during the registration phase.
Using text data as input to research is not new in the economic literature [5]. The information encoded in digital texts represents a useful complement to more standard and structured data, and this is testified by the remarkable growth of economic research in recent years that uses texts as data. In finance, for example, text from newspapers and social media is used to forecast stock price movements [6]. In microeconomics, text from advertisements and product reviews allows for the study of consumer decision drivers [7]. In industrial economics, text describing products is used to propose alternative industry classifications [8].
Analysing textual data is a challenging activity when the information is embedded 'somewhere' within large masses of unstructured data. Using text data means starting a step backwards. Once the suitable data source has been identified, the text needs to be manipulated and elaborated into meaningful patterns of understanding and insightful perspectives.
Our paper uses text mining and semantic algorithms to tag innovative firms and offer an alternative perspective to classify industrial activities. Evidence is interesting because allows us to understand 'what firms do' in a more penetrating and updated way than referring to SIC codes. Moreover, by matching firms' keywords, we can explore the degree of interconnection between the firms under observation, a measure by which researchers can derive industrial proximity. The paper is organized as follows. The next sub-section presents the relevant literature: starting from the limitations of the SIC codes, the focus is on new approaches and methods based on text data to describe the content of firms' innovative activities. Section Data illustrates the dataset: the analysis has been carried out on a sample of 583 innovative firms active in Chieti and Pescara, in Abruzzo, Italy. The choice of such a geographic area is motivated by the presence of a large and diversified production system, characterised by numerous and remarkable innovative firms. The next Section illustrates the methodology. A synthetic but comprehensive description of the methodological and operating steps is provided, sacrificing some technical passages and an in-depth discussion of the text mining and semantic algorithms. Even if these algorithms constitute the central part of the investigation, an exhaustive discussion does not seem to be justified for two orders of reason. First, in computational science such algorithms have become a standard with countless applications in many fields of study. Secondly, the algorithms adopted do not show any special advantage compared to others, or at least this is not what we intend to argue in this paper. Section Results emphasizes the abundance of the information that can be attained from the collection and elaboration of text data. Also, a discussion is proposed of the advantages that derive from organizing the labels in different levels or categories of interest and of the novelty of the different views that can be proposed to appreciate the degree of proximity between firms. In fact, once the keywords have been created and sorted into categories, the analysis turns on the resulting matrices of adjacencies by which to bring close one firm to each other based on the number of co-occurrences of categories and keywords between them. For the sake of simplicity, instead of focusing on networks of firms (583 nodes), which would be too large to investigate in detail in this paper, we aggregate firms by sectors (36 nodes) approaching them based on firms' common specializations and competencies. The last Section concludes, emphasizing limitations and highlighting policy implications.

Background
In the modern world, it is crucial to identify correctly and readily innovative firms and classify their industrial activities. However, this information is not easy to elaborate on. It is well known that the starting point for understanding what firms do is the SIC system. Using SIC codes is a common practice, even though SIC codes can be uninformative or misleading. The SIC system has many limitations. The descriptions adopted are too concise. The codes, though numerous, are insufficient to represent the variety of existing economic activities [9]. The SIC codes are often obsolete and out of step with technological evolution [3]. Sometimes, to avoid being 'confined' to specific activities or areas, firms opt for broader residual codes. Finally, firms are constantly changing and renewing their businesses: the code chosen yesterday may only partially reflect what the firm does today [10].
In addition to the criticalities mentioned above, there is a legitimate question. In the era of big data, in which firms make available a lot of information that can be collected and processed to classify in a detailed and updated way their industrial activities, why not attempt to make some use of it? Many recent contributions propose original methods starting from text data to classify firms and industries. Below we propose a brief review of the key literature.
[2] develop a sector-product approach and employ text mining to enrich the description of the firm's activities in the ICT and digital sector in the United Kingdom. The authors use raw text data and contextual information gathered from websites and news feeds. Interestingly, they affirm that using text mining might provide further detail over SIC codes, which tends to lag far behind technological evolution [11]. propose the web and new methods of data extraction to derive metadata useful for the industry classification by looking at a regional case in the Northeast of England. The exercise proposed by the authors is a tool to identify specific aggregates of industrial activities in a geographic area. The discussion starts by highlighting the limitations of SIC codes and is followed by the proposed methodology, based on web-based data collection, pre-processing and analysis, and reporting of clusters [12]. conduct a bottom-up study with the aim of overcoming the limitations of industry classifications to study the composition of the economy. Applying machine learning and graph theory techniques, the authors analyze company descriptions extracted from company websites and generate alternative taxonomies on the basis of which they define industries as 'communities within word networks'. Sometimes the researchers' intentions do not stop at describing firms' activities in a new and original way and go further to explore their innovative activities [13]. formulate a novel approach to estimate firms' innovation activity based on texts on corporate websites. They use automated web-scraper to harvest text from websites, then extract semantic topics in a selflearning, generative topic-modelling approach, and analyze these topics using a neural network method to assess each firm's level of innovation.
Increasingly, existing databases are being used to exploit the huge amount of structured and constantly updated data. They range from large proprietary databases to open repositories: in the former textual data, such as information on the companies' corporate mission, registered patents and economic and financial news published in the press, are largely available, while in the latter, there are descriptive summaries of firms' activities and products/ services sorted by keywords [14]. analyze the unstructured texts that describe firms' businesses using the statistical learning technique of topic modeling and construct a proximity measure based on the Latent Dirichlet Allocation algorithm, by which represent each firm's textual description as a probabilistic distribution over a set of underlying topics [15]. perform text mining on Crunchbase to work on green-tech firms in San Francisco, New York, and London. Using metadata the authors classify firms' industrial activities and underlying specializations, building links for technological and market complementarities, identifying specific firms' aggregates and emerging industrial clusters [16]. measure innovative digital firms' specializations and competencies based on the degree of digital technologies in the products and services supplied. The method allows to overcome the limitations of defining industrial specializations in digital industries through SIC codes and capture innovative firms' specializations at the metropolitan level [17]. propose a classification of specializations along the automotive supply chain in Italy based on the analysis of the descriptive texts of the activities provided in the process of registering with the Chamber of Commerce. The authors implement a multidimensional analysis of words to identify clusters of specializations, and a similarity analysis of words to provide indications on clustering of specializations as they are described by companies.
Sometimes it has been useful to look at alternative information sources than SIC codes to classify economic activities. In this vein are the studies by [8,18,19]. [8] collect business descriptions from thousands of firm 10-K product descriptions using web crawling algorithms and process the text to propose new industry classifications. The authors can study how firms differ from their competitors using new time-varying measures of product similarity, which allows for the generation of a new set of industries in which firms can have their own distinct set of competitors.
Once firms' activities have been classified, scholars start to address further research questions. One of these concerns the conditions and modalities by which firms exchange knowledge, promoting its diffusion and re-generation. Such an exchange becomes a knowledge flow, something not easily traceable. In this case, economists are inclined to use proxies, assuming that these flows have a higher chance to occur between firms that are 'closer' in the industry space. In the first place, economists use the SIC system to assume some exchange of knowledge if two firms share the same two-, three-, four-, or five-digit codes [20,21]. However, such measures are still discrete, and the level of granularity is constrained by the adopted classification system. Moreover, whether such measures are indicative of industrial proximity can be questioned [22]. The most recent attempts to define and classify industrial activities based on alternative data sources all go hand in hand with the next step of operationalizing proximity between firms. Accordingly, albeit not in depth, this article proposes to explore industrial proximity based on the resulting matrices of adjacencies between firms.

Data
The dataset used for our analysis merges information from different databases, including startup.registroimprese.it by Unioncamere and Analisi Informatizzata delle Aziende (Aida) by Bureau Van Dijk: the former is the official database of the Chambers of Commerce that collects Italian startups and innovative SMEs; the latter includes comprehensive information concerning corporate purposes and financial indicators on companies in Italy.
The initial perimeter includes several hundred firms based in the provinces of Chieti and Pescara, in Abruzzo. Abruzzo has a production system of quality and excellence, large and diversified. Abruzzo is seventh in Italy for industrial expertise and for the impact of exports on GDP, sixth for trade surplus and second for exchange value [23]. The province of Chieti is specialized in the automotive sector, with large firms operating in the production of vehicles and components engineering. In the province of Pescara there is one of the most competitive supply chains in sanitary products in paper and cotton, from machinery to the production of materials, up to packaging [23]. In the provinces, there are also the paper and paperboard industry, activities related to the mining industry, professional, scientific, and technical activities, and advertising agencies. Other important industries are pharmaceuticals and related industries [23].
To generate and assign keywords to firms, we use different sources of text data such as the company website, the corporate purpose, and synthetic descriptions of the firm's economic activity. On the data sources, we emphasize the importance of the company website that well explains firms' innovative activities [24]. The corporate purpose does not help when too vaguely defines firms' core businesses, which might happen when there is an interest to 'leave open' possible future paths of development and activities. In such a case, we privilege the website, the more updated picture of the business activity. Otherwise, the corporate purpose might be very informative: innovative start-ups and SMEs confirm this. This is explained by the strict assessment carried out in Italy by the local Chambers of Commerce before registration. In all those cases in which the corporate purpose is formulated in a clear and defined way, it represents a useful source of information, which enriches and complements the website.
We employed a specific procedure to enucleate a limited number of innovative firms over which to perform our text mining and semantic algorithms. Even though such a procedure is out of the scope of the present paper, we sum up the adopted reasoning below. The presence of specific keywords (such as 'innovation' and 'technology' in their different declinations) in the text describing the firms helped us to capture some 'essential traits' of innovation that, together with the existence of further terms on the firms' positioning on the market and/ or their ability to export (e.g., 'leadership', 'export' and 'international' in their different declinations), allowed to narrow the perimeter of the analysis. Also, a scoring of confidence and presence attached to a list of keywords related to recent technological advancements (e.g., 'Digital technologies', 'Artificial intelligence', 'Industrial automation', 'Robotics', 'Augmented reality', 'Cybersecurity', 'Edge computing', and so on) has been used to validate the firms' degree of innovation and circumscribe the target to investigate. Following the selection process, 583 firms have been identified. 56% of the total number of firms are in the province of Pescara and 44% in the province of Chieti. More specifically, the observed firms are located in three main areas: the first includes an area ranging from the municipality of Pescara to that of Chieti, including neighboring municipalities. The second includes an area that goes from Ortona to the industrial zone of Val di Sangro, through the municipality of Lanciano. The third area, with a much smaller dimension, goes through the municipalities of Vasto and San Salvo (Fig 1).
Many firms are active in the 'Knowledge Intensive Activities', even called KIAs [25], which can be divided into manufacturing (science-based industries and specialized machinery and devices), services (software, consulting and engineering services, architecture and R&D) and art, culture and creative activities. Table 1 shows the frequency of the observed firms by Ateco 2007, which is the classification of economic activities adopted in Italy. Ateco 2007 classification is the national version of the European nomenclature, Nace Rev. 2. Ateco 2007 has been set out and approved by a Steering Committee that, in addition to Istat, is participated by the Ministries concerned, the Bodies which manage the main administrative data sources on firms (Agenzia delle Entrate, Chambers of Commerce, social security institutions) and the main business associations.
Since Ateco 2007 shows the same limitations as any SIC system, we propose an alternative approach to classify observed firms and industrial activities.

Methodology
The proposed analysis uses algorithms of text mining and semantics that enable a streamlined processing of descriptive data. As mentioned, the reason for this massive text analysis lies in the need to tag in an informative and updated way the activities carried out by innovative firms, trying to capture their specializations and competencies.
We processed multiple types of information sources such as corporate.html pages (through web crawling, where permitted) and.txt files on corporate purposes, together with other firms' descriptions if available. This has been the first step to obtaining a multilabel classification.
We made use of a 'general purpose' natural language recognition model based on machine learning algorithms pre-trained on different knowledge bases (such as Wordnet, Wikipedia, Dbpedia, and thousands of textbooks). We performed pre-processing procedures typical of text mining (e.g., lemmatization, stemming, stop-words, and so on). Additional modules have been used for spell-checking and language detection. We employed mixed models that draw on multiple existing and updated lists/ taxonomies and leverage access programming interfaces (APIs) to large libraries offered by software houses specializing in text analysis.
Afterwards, we switched to a 'specific model', calibrated to our research goal: what innovative firms do. We performed the labeling phase, that is assigning tags or labels to the observed firms, also by means of semantic understanding of the text. We carried out a first calibration based on some basic and easily understandable rules: for example, using corporate purposes to describe business activities requires the removal of standard ancillary activities (e.g., 'Ancillary to its principal business, the Company may also purchase, own, manage, use, update and develop, directly or indirectly, trademarks, patents and know-how concerning electronic tolling systems and related or connected activities').
We performed a semi-automatic check on tags to assess the quality of the generated output and added rules to reduce the noise acquired during the extraction phase. Given our interest in digital technologies, we carried out a further calibration focusing on ICT specializations. We proceeded with a normalization of the dataset. We used pre-defined algorithms to obtain a multi-label classification and assign each label to the categories or level of interest. We exploited taxonomies updated with technological evolution. We employed two families of algorithms: extraction algorithms, which aim to identify the keywords characterizing the business activity, collected in a category called 'entities', and classification algorithms, which assign firms to pre-established categories. The former algorithms allow for the profiling of firms with specific details that firms themselves offer in the description of their own businesses. The classification algorithms create 'redundancy' by assigning firms to categories, which is fundamental to ensure the matching of firms.
Once extracted, after an accurate work of revision and standardization, the keywords are organized by category or level. A key aspect of this exercise concerns the choice of the taxonomies: these must be as broad and up to date as possible. We started with a set of taxonomies for classifying specializations and competencies using our previous knowledge base and external sources. The latter consist of expert-driven (where taxonomies are based on expert input) and data-driven classifications (where taxonomies are formulated using machine learning algorithms following the processing of large volumes of data).
For the first level (sectors) we referred to updated taxonomies adopted by open databases on innovative and high-tech sectors that collect hundreds of thousands of companies active in innovative sectors, including Crunchbase, Dealroom and AngelList. The list of sectors is provided in Table 2.
With regards to the other two categories (specializations and competencies), we relied on the one hand on taxonomies employed by software houses specializing in text analysis (such as, for example, Text Razor, Aylien, Dialogflow) and, on the other, on taxonomies that, although built in different contexts and for different purposes, are useful references. Among these last ones, we referred to vast documentation, including [26][27][28], and to existing taxonomies such as the Occupational Information Network [29], and the European Skills, Competences, Qualifications and Occupations [30]. The specialization category was created by exclusion: once the words were assigned to the categories of sectors and competencies (broadly defined), the more general or common words were attached to the category 'entities' (and, then, used to tag firms) or removed if not informative on firms' activities. After these steps, a large basket of words describing the activities carried out by the observed firms has been created. As known, competencies are an umbrella notion, very difficult to circumscribe [31]. In this paper, we define competence as the ability to apply knowledge and skills to achieve results, and for such a reason we opted to refer primarily to scientific disciplines to account for them. Therefore, within the sector in which the firms operate, it is possible to go further and characterize them based on the specializations that distinguish them within the industry on one side and on the disciplinary competencies possessed by the people working in the firms on the other. We assigned the 583 firms in target to more than a thousand categories: sectors (respectively, 32 unique words from corporate purposes and 36 from company websites), specializations (respectively, 310 unique words from corporate purposes and 931 unique words from websites), and competencies (respectively, 74 unique words from corporate purposes and 154 unique words from websites). Moreover, thousands of keywords in the category 'entities' were used to tag firms (respectively, 887 unique words from corporate purposes and 4614 unique words from websites).
Firms became strings of tags attached to different categories: sectors, specializations, competencies, and entities. Overall, more than 28 thousand keywords have been generated and used in the database. Afterwards, it was possible to match firms to each other by means of common specializations (e.g., 'Supply chain management') and/ or competencies (e.g., 'Statistics').
In this way, we moved from working on textual data to relational data. Once the keywords have been created and sorted into categories, the analysis turns on the resulting matrices of adjacencies by which to bring close one firm to each other based on the number of categories and keywords that co-occur between them. Connecting firms through the co-occurrence of keywords means relying on well-known graph theory to investigate networks, nodes, and edges. As anticipated above, instead of focusing on networks of firms (583 nodes), too large to investigate in depth, we aggregate firms by sectors (36 nodes) approaching them based on firms' common specializations and/ or competencies. Nonetheless, also many other forms of aggregations are feasible as we will show. Before presenting the results, it is appropriate to point out some disadvantages of using network analysis. These disadvantages apply in general and a fortiori were encountered in our analysis. Firstly, the collection of textual data for network analysis requires careful and meticulous filtering and cleaning to guarantee the accuracy of the terms selected for the construction of the graphs. Secondly, network analysis does not always allow us to propose reliable comparisons between different graphs; in fact, often different networks represent phenomena with different structures and, therefore, are not comparable. Consequently, the analysis proposed below will focus on the most macroscopic evidence regarding the structure underlying the different graphs and the most significant patterns.

Results
Instead of referring to the SIC system, we gathered information from companies' websites and corporate purposes and generated tags concerning firms' activities. Using such tags, we built a new dataset, which represented our starting point: the firms become sequences of tags that describe their activities or sectors, specializations, and competencies.
The resulting dataset provides a mapping of industrial activities with useful insights into firms' profile. The following sectors emerge as relevant: 'Information and communication' (13.03% of the total keywords assigned to the category sector), 'Software' (10.83%), 'Consulting activities' (8.17%), 'Internet & e-commerce' (7.84%), 'Plants and equipment' (6.85%), 'Research' (5.93%). Table 2 lists all sectors and their relative frequency. The most common specializations are 'Design' (intended as the design phase of new products and/ or services, 3.36% of total keywords assigned to the category specialization), 'Digital  Table 3 lists the top 30 specializations and their relevance in the database.
Co-occurrences of tags assigned to firms imply some proximity: two or more firms are close to each other if they are active in the same sector, are specialized in the same areas, and share some competencies.
As anticipated, most attempts directed to classify firms' activities aim also at capturing proximity between them. The most used measure uses the hierarchy of the SIC codes: the lower the class two firms share in the hierarchy, the more similar they are thought to be. According to this basic reasoning, firms in the same 5-digit class are more related than firms that only share the same 3-digit class. Fig 2 shows the network resulting from Ateco 2007, by which two firms are connected by a link of weight 5 if they share the same 5-digit class, by a link of weight 4 if they share the same 4-digit class, by a link of weight 3 if they share the same 3-digit class, by a link of weight 2 if they share the same 2-digit class. This network is undirected and weighted: this means that the higher the n-digit class, the heavier the edge between nodes.
The network of 583 firms is mostly fragmented, with several clusters apparently disconnected from each other. The average degree (that is, the average number of links per node) is 48.69, which would seem to indicate a rather high level of interaction between the nodes of the network but which, on closer examination, can be explained by the high level of interconnection between the nodes belonging to the software sector: 133 firms sharing the 5-digit class within economic activity '62 -Software'. Based on the value of the graph density (the ratio of the number of edges and the number of possible edges), it results in a rather limited level of interconnection, with a 0.09. High, as we can appreciate from the visualization, the level of dispersion of the network, with modularity (the measure of the strength of division of a network into groups or clusters; networks with high modularity have dense connections between the nodes within groups but sparse connections between nodes in different clusters) that equals 0.60. Is it correct to assume the presence of some links between firms because of the closeness within the Ateco 2007? In our opinion, the approximation is too crude: it is assumed a flow between two firms (only because) near in the SIC system, but nothing can be presumed on the possible content of the exchange. If isolating such content is a complex (if not impossible) task, a deeper investigation of firms' underlying industrial activities, specializations and competencies would allow for the deduction of something more about the emerging links.
The co-occurrence of tags across levels (sectors, specializations and competencies) and within the category 'entities' generates a highly interconnected network (Fig 3). This network, as well as all other following graphs, are undirected and weighted: the higher the number of co-occurrences, the heavier the edge between nodes.
The overall number of links is equal to 90972, which is 6,5 times larger than the previous one. The average degree is equivalent to 312.62, which suggests a high level of interconnection between the nodes. The value of the graph density is significantly higher than the previous graph, with a value of 0.54. Lower, as we can appreciate from Fig 3, the level of fragmentation of the network, with a modularity value equal to 0.20. While some clusters of firms are identifiable in terms of industry codes, such as the blue cluster (ATECO 62) at the bottom right, the green cluster (ATECO 72) above, and the red cluster (ATECO 25) at the bottom left, at the same time it is possible to discern the numerous mixtures between firms belonging to such ATECO codes and firms of different industries. For example, using text data ATECO 25 firms become close to ATECO 22 firms (orange nodes) and to other residual classes (nodes in grey, which refer to ATECO codes with a frequency of less than 3%). ATECO 72 firms at the top (in green) are close to ATECO 70 (light blue), ATECO 74 (light green) and ATECO 64 (violet) firms. The most relevant cluster in terms of nodes (blue, ATECO 62) is divided into three subclusters: the bottom right cluster (more colorful and wide-ranging in terms of ATECO 2-digit codes), the middle cluster with ATECO 73 firms (brown nodes) close to codes 72 and 74 (dark green and light green, respectively) and, finally, the bottom left cluster (mixed with ATECO 26 firms in bright pink).
Once we begin to aggregate firms by sector or specializations, we attain different and, often, more insightful views. Below, we provide some evidence that helps to understand which are the most central sectors, based on specializations and competencies respectively, and how close these are in Chieti and Pescara.
The first representation is useful to appreciate the interrelationships between firms, aggregated per sector of activity, to have a smaller number of nodes and a more intelligible graph. The links in the network are based on the co-occurrence of specializations between firms, aggregated by sector. The graph shows a rather high average degree (19.94), an average weighted degree (the average of sum of weights of the edges of nodes) of 9923.56, a network diameter (the maximum distance between any pair of nodes in the graph) of 3, and a graph density of 0.57. The average path length (the average number of steps along the shortest paths for all possible pairs of network nodes) is 1.43, while the modularity is 0.18 and the average clustering coefficient (the average degree to which nodes tend to cluster together) is 0.79 (Table 5).
Based on common specializations, the sectors 'Information and Communication technology' and 'Software' are at the center of the innovative economic system (Fig 4): looking at the weighted degree (that is, the sum of weights of the edges of nodes), the former shows 53958, while the latter 48092 (Table 6). Other sectors follow with lower values, such as 'Internet & ecommerce' (37620), 'Consulting activities' (25266) and 'Plants and equipment' (22038).
Even though we know that all these sectors can be considered interrelated based on similar specializations, we still aim at realizing 'how much' these sectors are close to each other. Such evidence becomes measurable through the metrics of the graphs. We report some measures of relevant relationships that can be used as benchmarks: the weight of the link between the firms active in 'Software' and those in 'Information and Communication technology' is equal to 12300, between firms in 'Information and Communication technology' and those in 'Internet    In some respects, this is easy to understand since they are fundamental sectors, supporting the economic system (Table 8). Also, in this case, we can identify towards which sectors both 'Research' and 'Consulting activities' offer their specific knowledge. In this case, we detect the links between sectors based on the underlying competencies.
The main target sectors are 'Energy, environment and utilities', 'Plants and equipment', 'Professional, scientific, and technical activities' and 'Transport, logistics and storage'. We can focus on some links that can be taken as reference: the weight of the link between the firms active in 'Software' and those active in 'Information and Communication technology' is equal to 3450, between 'Information and Communication technology' and 'Internet & e-commerce' equals 2720, 'Software' and 'Electronics' is 1270, between 'Consulting' and 'Information and Communication technology' is 1248, between 'Consulting' and 'Professional, scientific and technical activities' is 1176, between 'Plants and equipment' and 'Software' is 1024, between 'Research' and 'Information and Communication technology' is 972. Even in this case, we are not only able to measure the weight of the links, but also identify the driving competencies behind such links: 'Computer engineering', 'Mechanical engineering', 'Chemistry' and 'Business disciplines'.
The reasoning above is replicable to a network of specializations. For visualization purposes, we are restricting the network to the top 130 specializations (Table 9). In this case, specializations are brought close to each other if these are accompanied in one or more firms by the same competencies: the higher the number of firms sharing the same specializations and competencies, the heavier the edge between specializations in the graph.
Data management and digital technologies play a pivotal role in terms of the underlying specific competencies in Chieti and Pescara (Fig 6).

Conclusions
The paper intended to propose an original method to tag innovative firms and classify industrial activities. Instead of referring to SIC codes, we gathered information from companies' websites and corporate purposes, extracted keywords and generated tags concerning firms' activities. Therefore, firms became sequences of tags ordered on different levels or categories: sectors, specializations, and competencies.
Why transform firms' descriptive texts into keywords? There are at least two reasons for this. The first lies in the fact that investigating innovative activities is a complicated task, especially in modern times when technological evolution is rapid, and innovation is incessant. Therefore, starting from a large and updated information base, even though fluid and dynamic, helps a lot. The second lies in the fact that the keywords can be used to link firms and sectors (groups of firms) based on the researcher's interest, for example using underlying specializations or competencies. Also, as seen, firms' specializations can be grouped based on underlying competencies.
Our paper used text mining and semantic algorithms to tag innovative firms and offer an alternative perspective to classify industrial activities. Evidence is interesting because allows us to understand what firms do in a more penetrating and updated way than by referring to SIC codes. Keywords are generated from firms' descriptive texts and, for this reason, are more informative than short and static industrial classifications. Moreover, through matching firms' keywords, we were able to explore the degree of interconnection between firms, a measure by which researchers can derive industrial proximity. We were able to bring close firms based on the tags they have in common (e.g., linking all firms that are specialized in 'supply chain management'). Similarly, we linked firms because of the specific competencies on which they build their own specializations (e.g., connecting firms that share 'computer engineering'). In general, as well known, such exercises are useful because they allow circumscribing of specific clusters of economic activities, from which firms absorb knowledge and ideas and in which they evolve and transform. Using the keywords assigned to firms helps to discover a world of existing and intangible relationships, often hidden, which represent in some ways the DNA of the economic system under observation and allow to capture the 'energy' behind new entrepreneurial initiatives, innovative projects, interindustry collaborations, and synergic partnerships. Taking inspiration from a metaphor proposed by [32] such an investigation of the economic activities seems to recall the activity of biologists when they look at phenotypes (the physical and functional characteristics of an organism) as expressions of genotypes (the information embodied in the DNA of an organism).
As illustrated above, in Chieti and Pescara the sectors, ICT and 'Software' are at the center of the innovative economic system by looking at the specializations. By using competencies, it was interesting to notice how 'Research' and 'Consulting activities', together with ICT, 'Software' and 'Internet & e-commerce', work as trait d'union. Below the link between firms' activities and specializations, there are plenty of competencies such as 'Computer engineering', 'Chemistry', 'Business disciplines', 'Electrical engineering', 'Computer science', 'Electronic engineering', 'Mechanical engineering' and 'Telecommunications engineering'.
These are just a few of the insights that can emerge from the proposed method: the results identified by the researcher will be the more precise, the greater the desire the researcher has in going into the detail of a specific industrial area. The proposed analysis shows some limitations. First, the performance of the exercise depends on the quality of the text data sources. This implies two orders of consideration. The former relates to the choice of the source data and concerns, as mentioned above, the wide divergence existing between websites and corporate purposes. The latter regards the variability that can exist in terms of the broadness of the descriptions of business activities between one firm and another: there are firms that show very precise descriptions of their business and firms that, instead, tend to describe themselves in an oversimplified way. Secondly, the selection of the firms to investigate might be biased by the scarce information provided by companies' websites and/ or corporate purposes. This is crucial in those cases in which the researcher is interested in defining a precise target of firms. Thirdly, there is the question of the taxonomy employed. This is never complete and often is the result of a combination of several taxonomies, as illustrated above, each characterized by some limitations in breadth and/ or in depth. In this sense, the diffusion of more empirical exercises aimed at describing firms' activities, as well as the introduction of new databases and/ or the enrichment of existing libraries, can contribute to making the classification a more robust and less questionable step.
The proposed exercise suffers from a major limitation. SIC codes are useful because they classify firms over years, going back in time, while the text data approach would not allow for the identification of a firm's sector for each year in the last five or 15 years, unless the firms' descriptions have been captured over time and the information archived, keeping the extraction methodology and the linking taxonomy intact.
One point remains robust: the breadth and depth of the information processed and the chance of building numerous graphs to relate firms (and groups of firms) in a very precise way. The analysis provides policymakers with a detailed and comprehensive picture of the innovative trajectories underlying the industrial structure. Some of the questions that the above results help answer are: Which firms (are similar or) have a specific specialization or competence? How common is that specialization or competence in the observed group of firms? What competencies link firms active in apparently distant industries?